Open-Domain Question Answering (ODQA) requires models to answer factoid questions with no context given. The common way for this task is to train models on a large-scale annotated dataset to retrieve related documents and generate answers based on these documents. In this paper, we show that the ODQA architecture can be dramatically simplified by treating Large Language Models (LLMs) as a knowledge corpus and propose a Self-Prompting framework for LLMs to perform ODQA so as to eliminate the need for training data and external knowledge corpus. Concretely, we firstly generate multiple pseudo QA pairs with background passages and one-sentence explanations for these QAs by prompting LLMs step by step and then leverage the generated QA pairs for in-context learning. Experimental results show our method surpasses previous state-of-the-art methods by +8.8 EM averagely on three widely-used ODQA datasets, and even achieves comparable performance with several retrieval-augmented fine-tuned models.
translated by 谷歌翻译
图像变压器最近使用监督(VIT,DEIT等)或自我监督(BEIT,MAE等)预训练技术取得了显着的自然图像理解进展。在本文中,我们提出了\ textbf {dit},一种自我保护的预训练\ textbf {d} ocument \ textbf {i} mage \ textbf {t} ransformer模型,使用大规模的不尺度的文本图像用于文档AI任务,这是必不可少的,因为由于缺乏人类标记的文档图像,因此没有受到监督的同行。我们将DIT作为骨干网络在各种基于视觉的文档AI任务中,包括文档图像分类,文档布局分析,表检测以及OCR的文本检测。实验结果表明,自我监管的预训练的DIT模型可在这些下游任务上实现新的最新结果,例如文档图像分类(91.11 $ \ rightarrow $ 92.69),文档布局分析(91.0 $ \ rightArow $ 94.9),表检测(94.23 $ \ rightArrow $ 96.55)和OCR的文本检测(93.07 $ \ rightarrow $ 94.29)。代码和预培训模型可在\ url {https://aka.ms/msdit}上公开获得。
translated by 谷歌翻译
轨迹预测面临着困难,以捕获具有多样性和准确性的未来动力学的多模式性质。在本文中,我们提出了一种分布歧视(DISDIS)方法,可以通过区分潜在分布来预测个性化运动模式。由于每个人的习惯,每个人的运动模式都被个性化,我们的disdis学会了潜在分布来代表不同的运动模式,并通过对比度歧视来优化它。这种分布歧视鼓励潜在分布更具歧视性。我们的方法可以与现有的多模式随机预测模型集成为插件模块,以了解更具歧视性的潜在分布。为了评估潜在分布,我们进一步提出了一个新的度量标准,概率累积最小距离(PCMD)曲线,该曲线累计计算了分类概率的最小距离。对ETH和UCY数据集的实验结果显示了我们方法的有效性。
translated by 谷歌翻译
Emotion-cause pair extraction (ECPE) aims to extract emotion clauses and corresponding cause clauses, which have recently received growing attention. Previous methods sequentially encode features with a specified order. They first encode the emotion and cause features for clause extraction and then combine them for pair extraction. This lead to an imbalance in inter-task feature interaction where features extracted later have no direct contact with the former. To address this issue, we propose a novel Pair-Based Joint Encoding (PBJE) network, which generates pairs and clauses features simultaneously in a joint feature encoding manner to model the causal relationship in clauses. PBJE can balance the information flow among emotion clauses, cause clauses and pairs. From a multi-relational perspective, we construct a heterogeneous undirected graph and apply the Relational Graph Convolutional Network (RGCN) to capture the various relationship between clauses and the relationship between pairs and clauses. Experimental results show that PBJE achieves state-of-the-art performance on the Chinese benchmark corpus.
translated by 谷歌翻译
近年来,基于深度卷积神经网络(CNN)的细分方法已为许多医学分析任务做出了最先进的成就。但是,这些方法中的大多数通过优化结构或添加U-NET的新功能模块来改善性能,从而忽略了粗粒和细粒的语义信息的互补和融合。为了解决上述问题,我们提出了一个称为渐进学习网络​​(PL-NET)的医学图像分割框架,其中包括内部渐进式学习(IPL)和外部渐进学习(EPL)。 PL-NET具有以下优点:(1)IPL将特征提取为两个“步骤”,它们可以混合不同尺寸的接收场并捕获从粗粒度到细粒度的语义信息,而无需引入其他参数; (2)EPL将训练过程分为两个“阶段”以优化参数,并在上一阶段中实现粗粒信息的融合,并在后期阶段进行细粒度。我们在不同的医学图像分析任务中评估了我们的方法,结果表明,PL-NET的分割性能优于U-NET及其变体的最新方法。
translated by 谷歌翻译
近年来,随着对公共安全的需求越来越多,智能监测网络的快速发展,人员重新识别(RE-ID)已成为计算机视野领域的热门研究主题之一。人员RE-ID的主要研究目标是从不同的摄像机中检索具有相同身份的人。但是,传统的人重新ID方法需要手动标记人的目标,这消耗了大量的劳动力成本。随着深度神经网络的广泛应用,出现了许多基于深入的基于学习的人物的方法。因此,本文促进研究人员了解最新的研究成果和该领域的未来趋势。首先,我们总结了对几个最近公布的人的研究重新ID调查,并补充了系统地分类基于深度学习的人的重新ID方法的最新研究方法。其次,我们提出了一种多维分类,根据度量标准和表示学习,将基于深度学习的人的重新ID方法分为四类,包括深度度量学习,本地特征学习,生成的对抗学习和序列特征学习的方法。此外,我们根据其方法和动机来细分以上四类,讨论部分子类别的优缺点。最后,我们讨论了一些挑战和可能的研究方向的人重新ID。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译